fix[next]: Fix segfault for nanobind >=2.10#2431
fix[next]: Fix segfault for nanobind >=2.10#2431tehrengruber wants to merge 7 commits intoGridTools:mainfrom
Conversation
|
Other also seem to be running into this issue unless they install with pinned versions, e.g. from |
There was a problem hiding this comment.
Pull request overview
This pull request implements a workaround for a segfault issue that occurs with nanobind version 2.10 and later. The root cause is that nanobind extension modules are being garbage collected while their functions are still in use, leading to crashes when calling those functions with ndarray arguments. The PR addresses this by creating a wrapper class that holds references to both the module and the function, preventing premature garbage collection.
Changes:
- Added import of
castfrom typing to support type casting - Modified the compilation process to wrap the compiled function in a dynamically created class that maintains references to the extension module
- Added detailed comments explaining the workaround and linking to the upstream nanobind issue
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Temporary fix for a segfault with at least nanobind 2.10.2. The circumstances of the environment are not really clear yet, but this looks like a nanobind issue on first sight. I could reproduce the error with python 3.10 & 3.12.
Error message:
Source of the error:
This innocent looking line in src/gt4py/next/otf/compilation/compiler.py triggers the error
After the
getattrcall the moduleimporter.import_from_path(src_dir / new_data.module)is garbage collected resulting in a call tonanobind::detail::nb_module_clear. This function (for unknown reasons) garbage collects the value stored instatic_pyobjects[pyobj_name::dl_version_tpl]which is used innanobind/include/nanobind/ndarray.hwhen calling the compiled program in src/gt4py/next/program_processors/runners/gtfn.py.Steps to debug:
The proper solution is likely for each function in nanobind to keep a reference to the module.
Update: we opened an issue in the nanobind repo reporting this bug: wjakob/nanobind#1283